feat(mcp): multi-module Go trace-quality + small-repo retrieval tuning by colbymchenry · Pull Request #494 · colbymchenry/codegraph

colbymchenry · 2026-05-27T07:28:58Z

Summary

Two related streams that landed together on this branch:

1. Multi-module Go trace quality

Driven by an 8-question agent-eval audit (cobra, gin, prometheus, cosmos-sdk, etcd). The empirical gate ruled out go.work parsing as the real gap (prometheus crushes without it). Actual failure modes + their fixes:

Generated-file noise warped disambiguation. codegraph_search "Send" on cosmos-sdk returned the gRPC stub at tx_grpc.pb.go:124 first; trace landed on the empty stub and the agent fell back to Read. Fix: src/extraction/generated-detection.ts — path-pattern classifier for .pb.go, .pulsar.go, _grpc.pb.go, _mock.go, _mocks.go, mock_*.go, .generated.[jt]sx?, _pb2(_grpc)?.py, .pb.{cc,h}, .g.dart, .freezed.dart. Applied as a stable sort tiebreaker in findSymbol, findAllSymbols, codegraph_search (MCP + CLI), codegraph_explore file ranking, and context formatter Entry Points / Related Symbols / Code blocks.
Go has no static interface→impl bridge. Structural typing means the existing interfaceOverrideEdges (Java/Kotlin only) doesn't apply. Fix: goGrpcStubImplEdges synthesizer in callback-synthesizer.ts — detects UnimplementedXxxServer structs in generated files, identifies RPC methods (excluding mustEmbed* / testEmbeddedByValue), emits calls edges to matching methods on any non-generated struct whose method-name set is a superset. 467 bridge edges on cosmos-sdk; bank's UnimplementedMsgServer::Send points to msg_server.go only — not to msgClient siblings or mocks.
Trace failure used to fan out into 3-5 follow-up calls. Fix: inline both endpoints' bodies (capped 120 lines / 3600 chars), their callers (≤6), callees (≤8), AND the other top-level functions/methods in the destination's file in one response. Replaces the node→search→node→Read fan-out.
Trace endpoint pairing picked by FTS rank. On a multi-module repo, EndBlocker exists in 20+ modules. Fix: score every from×to combo by shared directory prefix length (full candidate set, not just FTS top-5), with a less-canonical-path penalty (enterprise/, contrib/, examples/, vendor/, third_party/, deprecated/, legacy/) so the canonical-module pair wins. FindPath probe budget capped at 20.
Test-file deprioritization in codegraph_explore isLowValue — adds Go's _test.go, Ruby's _spec.rb, JS/TS .test.ts/.spec.tsx, JVM *Test.java/*Spec.kt. Without this, etcd's watchable_store_test.go consumed 5K chars of explore budget.

2. Small-repo retrieval tuning (`<500` files)

The micro-repo tier had its own failure mode: lots of small MCP calls cost more in cache-write tokens than the repo is worth. Three coordinated changes:

Tool surface gating. Project under 500 indexed files exposes only the 5 core tools (search/context/node/explore/trace). Empirically validated as the floor — 3-tool gate regressed cobra/ky/sinatra, 1-tool gate catastrophically regressed express (+107% LOSS).
Sufficiency steering. codegraph_context responses on sub-500 projects end with a strong directive telling the agent the response IS the comprehensive pass — follow-ups should be narrow (trace from→to, single-symbol node), not another broad explore.
Tighter budgets. New sub-150 explore-output tier (13K total / 4 files / 3.8K each, Relationships dropped, test/spec/icon/i18n hard-excluded unless the query is about tests). maxNodes defaults to 8 instead of 20 on sub-150 context calls.

3. Other improvements that landed alongside

Auto-trace inline in codegraph_context when the task looks like "how does X reach Y" — runs the trace internally and splices its body in. Conservative detection (flow keyword + ≥2 PascalCase/camelCase identifiers). Saves the git-hook potential issue when codegraph is not installed globally #2 cost-driver follow-up call on multi-module flow questions.
Routing manifest inline for small-repo routing queries — compact URL → handler table built from route nodes + their references/calls edges, plus the top handler file's source. Beats the Glob+Read pattern that was winning on realworld template repos (rails-realworld, laravel-realworld, drupal-admintoolbar).
Core-directory ranking boost — projects with a dominant in-file edge-count file (sinatra's base.rb at ~85%) now boost search results in that directory by +25 score, so the core file's siblings outrank sibling-package extensions. Generated/test files excluded from "dominant file" candidacy.
interfaceOverrideEdges extended beyond JVM — Java/Kotlin → also C#, TypeScript, JavaScript, Swift, Scala. Swift conformance iterates struct nodes too.
MCP catch-up gate. Post-open cg.sync() was fire-and-forget; first tool call now awaits it so files deleted/edited while no server was running can't produce stale rows (per-file staleness banner can't help — that signal is watcher-populated). Subsequent calls pay nothing.
Shorter MCP tool descriptions. All 10 codegraph_* descriptions condensed (~50% shorter); load-bearing steering stays in server-instructions.ts.

Empirical results

docs/benchmarks/call-sequence-analysis.md and the per-arm harness in scripts/agent-eval/ track the numbers. Headline cosmos-sdk + etcd table (n=2 per question, headless):

Repo / Q	WITH cost	WITHOUT cost	WITH Reads	WITHOUT Reads	WITH time	WITHOUT time
cobra (parse cmds)	$0.27	$0.27	0	4	39s	60s
prometheus (scrape→TSDB)	$0.63	$0.70	0	6	106s	143s
cosmos-sdk Q1 (MsgSend)	$0.41	$0.26	1	2	67s	64s
cosmos-sdk Q2 (MsgDelegate)	$0.47	$0.46	0	5	50s	73s
cosmos-sdk Q3 (gov tally)	$0.34	$0.31	1.5	3	54s	76s
etcd Q1 (Put→raft)	$0.65	$0.78	0	4	98s	129s
etcd Q2 (watch)	$0.36	$0.50	0	4+	58s	89s

Codegraph wins on reads and time across every question. Cost is 3 clean wins, 3 within-10% ties, and 1 stubborn loss on cosmos Q1 (a grep-favored question where the WITHOUT path is structurally short). Cosmos-sdk cost gap collapsed from -60% avg to -15% avg vs baseline; Q3 went from 75% loss to a tie.

Test plan

npm test — 1081 passed (50 files), including new
__tests__/generated-detection.test.ts (4 cases pinning the suffix
contract), __tests__/mcp-catchup-gate.test.ts (5 cases for the
gate behavior + drop-after-first-await), Go gRPC stub-impl synthesis
cases in __tests__/frameworks-integration.test.ts, and the updated
__tests__/explore-output-budget.test.ts covering the new
<150 tier
npm run build clean
cosmos-sdk Q1 r1 + r2 / Q2 / Q3
etcd Q1 + Q2 (real go.work repo, different from cosmos)
prometheus + cobra control runs (no-regression)
Bridge edge spot-check on cosmos-sdk: bank's UnimplementedMsgServer::Send → msgServer::Send, no mock/client false positives

🤖 Generated with Claude Code

…ilure inlining Multi-pronged fix to make codegraph competitive on Go multi-module repos (cosmos-sdk, etcd) where it previously lost or tied. Driven by an 8-question agent-eval audit across cobra, gin, prometheus, cosmos-sdk, and etcd: the baseline had codegraph losing ~60% on cost on cosmos-sdk and mixed on etcd deep cross-module flows, while winning cleanly on the single-module and non-protobuf-heavy repos. Diagnostics ruled OUT `go.work` parsing as the gap (prometheus crushes without it). The actual failure modes were generated-file noise warping disambiguation, missing gRPC interface→impl bridge in structural-typing Go, and trace's failure path triggering 3-5 follow-up tool calls instead of inlining the material the agent needed. Changes: - New `src/extraction/generated-detection.ts` — path-pattern classifier for `.pb.go`, `.pulsar.go`, `_grpc.pb.go`, `_mock.go`, `_mocks.go`, `mock_*.go`, `.generated.[jt]sx?`, `_pb2(_grpc)?.py`, `.pb.{cc,h}`, `.g.dart`, `.freezed.dart`. Applied as a stable sort tiebreaker in `findSymbol`, `findAllSymbols`, `codegraph_search` (MCP + CLI), `codegraph_explore` file ranking, and context formatter Entry Points / Related Symbols / Code blocks. Cosmos's `msgServer.Send` now ranks #3 instead of #9 on a `Send` search. - New `goGrpcStubImplEdges` synthesizer in `callback-synthesizer.ts` — detects `UnimplementedXxxServer` structs in generated files, identifies their RPC methods (excluding `mustEmbed*` / `testEmbeddedByValue` gRPC markers), and emits `calls` edges to the matching methods on any non-generated struct whose method-name set is a superset. Closes Go's structural-typing gap that the existing `interfaceOverrideEdges` (Java / Kotlin only) couldn't bridge. 467 bridge edges on cosmos-sdk; bank's `UnimplementedMsgServer::Send` points to `x/bank/keeper/msg_server.go` only, not to `msgClient` siblings or mock files. - Trace-failure rewrite (`handleTrace`) — when no static path connects endpoints, instead of telling the agent to call `codegraph_node` (a 3-4-call fan-out), inline both endpoints' bodies (120 lines / 3600 chars per endpoint), their callers (≤6), and callees (≤8) in one response. - Trace endpoint-pairing improvements — scores every `from`×`to` candidate combo by shared directory prefix and tries the best-paired pair first (the full candidate set, not just FTS top-5). A less-canonical-path penalty (`enterprise/`, `contrib/`, `examples/`, `vendor/`, `third_party/`, `deprecated/`, `legacy/`) ensures the canonical-module pair wins even when a side-experiment shares more of its directory prefix. Find-path probe budget capped at 20 pairs. - Test-file deprioritization in `codegraph_explore` `isLowValue` — adds suffix patterns (`_test.go`, `_spec.rb`, `.test.ts`, `.spec.tsx`, `Test.java`, `Spec.kt`) alongside the existing directory-style patterns. Otherwise etcd's `watchable_store_test.go` consumes 5K chars of explore budget that should go to the hand-written flow source. Tests: - New `__tests__/generated-detection.test.ts` (4 unit tests) pins the suffix patterns. - New "Go gRPC stub→impl synthesis" integration test suite in `frameworks-integration.test.ts` (2 tests): positive bridge from stub to hand-written impl, AND the precision case (don't bridge to a generated sibling like `msgClient` in the same .pb.go). - Full suite: 1076/1076 pass. Empirical (post-fix, n=2 average per question): | Repo / Q | WITH | WITHOUT | Reads (W/WO) | Time (W/WO) |-------------------------|------------|-------------|--------------|------------ | cobra (parse cmds) | $0.27 | $0.27 | 0 / 4 | 39s / 60s | prometheus (scrape→TSDB)| $0.63 | $0.70 | 0 / 6 | 106s/143s | cosmos-sdk Q1 (MsgSend) | $0.41 | $0.26 | 1 / 2 | 67s / 64s | cosmos-sdk Q2 (Delegate)| $0.47 | $0.46 | 0 / 5 | 50s / 73s | cosmos-sdk Q3 (gov tally)| $0.34 | $0.31 | 1.5 / 3 | 54s / 76s | etcd Q1 (Put→raft) | $0.65 | $0.78 | 0 / 4 | 98s / 129s | etcd Q2 (watch) | $0.36 | $0.50 | 0 / 4+ | 58s / 89s Codegraph wins on reads + time on every question. Cost is mixed: 3 clean wins, 3 tied (within 10%), 1 stubborn cost loss on the grep-favored Q1. Compared to baseline, the cosmos-sdk cost-gap collapsed from -60% to -15% on average, and Q3 went from a 75% loss to a tie. Raw run artifacts in `/tmp/cg-finalv2-*/` and `/tmp/cg-final-*/`. Memory written at `project_go_multi_module_audit.md` for the methodology + before/after numbers. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

When a codegraph_context task contains a flow keyword ("trace", "from", "reach", "flow", "propagat", "how does", "how do") AND at least two distinct PascalCase / camelCase identifiers, internally invoke trace between the first two extracted symbols and splice the trace body into the context response. Conservative trigger by design: false positives waste one graph query; false negatives just fall back to the agent calling trace itself (existing path-proximity wiring handles either case). Goal: collapse the agent's typical context → trace → explore sequence into a single context call for clear flow queries, closing the remaining cost-overhead gap on multi-call patterns. The path-proximity + less-canonical-path scoring + the trace-failure-inlined-bodies behavior already let the inline trace land on the right endpoint pair and return enough material that no follow-up codegraph_node/Read is needed. Doesn't fire on: - cobra's "How does cobra parse commands and flags?" (no PascalCase symbols) — verified in regression run, no behavior change ($0.260 WITH vs $0.257 WITHOUT, basically tied) - queries where the agent doesn't call codegraph_context at all (cosmos Q1 in the audit went search → trace → node → trace → node) Tests: 1076/1076 still pass. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…n-out The cosmos-Q1 audit revealed a static-resolution gap: msgServer.Send's *real* next hop is `k.Keeper.SendCoins` — an interface-method call on an embedded field that tree-sitter can't resolve. The static getCallees list for msgServer.Send is all utility/error functions (StringToBytes, Wrapf, …). The actual flow (SendCoins → subUnlockedCoins → addCoins → setBalance) lives entirely inside `x/bank/keeper/send.go`, which is also where the TO endpoint (setBalance) lives. When trace fails (no static path), inline the **top 5 functions/methods in the destination file**, ordered by line-distance from the TO node. This catches the flow that interface-method calls obscure — the canonical "k.<Iface>.<Method>" pattern in Go, also relevant to Java dependency-injection / Rails service-object dispatch / etc. where interface dispatch hides the real call. Conservative: only fires on trace FAILURE (no static path); the success path is unchanged. Per-body cap (40 lines / 1200 chars), top 5 siblings. Bookkeeps with `inlinedBodies` Set so endpoints already shown above aren't duplicated. Result: cosmos-Q1 — historically the most stubborn cost loss (-2.2× to -39% across the audit) — flipped to a clean WIN: $0.257 WITH vs $0.449 WITHOUT (-43%), 34s vs 79s, 0 Reads vs 2 Reads + 5 Greps, 5 codegraph calls vs 12. Regression-checked: prometheus, cobra, cosmos-Q2, etcd-Q1 all still WIN; Q3 is high-variance ($0.30-$0.45 range historically) and fell within that on this run. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

PR review feedback: the audit was Go-driven, so the patterns I added were Go-flavored. Extend each axis to every language CodeGraph supports per the README, so the same improvements help Java / C# / Python / TS / Swift / Dart projects too. **generated-detection.ts** — Added patterns for: - TS/JS: `.gen.[jt]sx?`, `.pb.[jt]s`, `_pb.[jt]s`, `_grpc_pb.[jt]s` (ts-proto, gRPC-web, Apollo / GraphQL codegen, Hasura). - Python: `_pb2.pyi` (mypy stubs from protobuf). - C#: `.g.cs` (T4 / Razor codegen), `Grpc.cs` (protoc-gen-csharp). - Java: `OuterClass.java` (protoc-gen-java), `Grpc.java` (protoc-gen-grpc-java; this is where the `*ImplBase` abstract class lives — same shape as the Go `Unimplemented*Server` stub). - Swift: `.pb.swift` (protoc-gen-swift). - Dart: `.pb.dart`, `.pbgrpc.dart`, `.chopper.dart`. - Rust: `.generated.rs`. **test-file deprioritization** (`isLowValue` in `codegraph_explore`) — Added per-language conventions that the previous regex missed: - Python: `test_*.py` (pytest discovery) and `*_test.py`. - Ruby: `*_test.rb` (minitest) — `*_spec.rb` already covered. - C#: `*Tests.cs`, `*Test.cs`, `*Spec.cs`. - Swift: `*Tests.swift` (XCTest). - Dart: `*_test.dart`. **IFACE_OVERRIDE_LANGS** in `callback-synthesizer.ts`'s `interfaceOverrideEdges` — extended from `java, kotlin` to `java, kotlin, csharp, typescript, javascript, swift, scala`. Same shape across these (nominal `implements`/`extends` on a class to an interface/abstract base). Also iterates `struct` (Swift value types conforming to a protocol) in addition to `class`. The existing matchesSymbol-style logic and `getOutgoingEdges(..., ['implements', 'extends'])` work unchanged. **CLAUDE.md** — Added a House rule: when the user references issues or comments, anchor them to a date and version (last release vs. last main commit vs. current branch tip) BEFORE concluding a fix is incomplete. Issue #388 comments from May 25-27 were responding to the released v0.9.5 / merged-PR-469 state — not to this branch's in-flight work. The new rule walks through the disambiguation: `grep -m1 '^## \[' CHANGELOG.md` for release version, `git log --first-parent main -1` for main tip. Tests: 1076/1076 still pass. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Two cumulative changes targeting the small-repo cost gap surfaced by the cross-language audit: 1. **Tool descriptions trimmed** (~2.1KB total saved across 10 tools). The verbose marketing prose on codegraph_context / codegraph_node / codegraph_explore / codegraph_trace / etc. wasn't moving the agent toward better tool choices on top of the actual usage, but it was adding ~525 tokens of cache-creation overhead to every question. The trimmed descriptions keep the operational hints (e.g. "Query is a bag of symbol/file names, not a question" for explore) but drop the redundant prose. 2. **Dynamic tiny-repo tool gating** in `ToolHandler.getTools()`. On a project with < 150 indexed files, the MCP server only exposes the 5 core tools (search, context, node, explore, trace) instead of all 10 — the omitted callers/callees/impact/status/files tools' use cases on a sub-150-file repo reduce to one grep anyway. The MCP tool-defs overhead is the #1 source of cost loss on tiny repos (~$0.10-0.15 fixed cache-creation per question); cutting 5 tools drops that by ~50%. Effect on ky (~25 files, the worst pre-fix offender): - Before: $0.59 WITH vs $0.42 WITHOUT (+42% loss, n=1) - After: $0.32 WITH vs $0.44 WITHOUT (-26%, **flipped to WIN**) Effect on cobra/sinatra/slim (50-80 files): still cost-loss, but the gating doesn't regress them — same call-count, same reads. The structural lower bound on those repos is what the agent's grep+read path costs in absolute terms (~$0.20-0.30). Non-breaking for medium+/large repos: all 10 tools remain exposed when fileCount >= 150. Tests: 1076/1076 still pass. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ky flip to WIN) Combines the tool gating from the previous commit with a matching explore-budget cut for projects under 150 files. The two together close the cost gap that neither closes alone: - Tool gating alone helped ky (WIN) but didn't move cobra/slim/sinatra - Explore-budget cut alone helped slim slightly but regressed cobra - COMBINED: cobra flips to WIN, ky stays a WIN, ky/cobra both clean `getExploreOutputBudget(fileCount < 150)` returns: maxOutputChars: 13000 (was 18000) defaultMaxFiles: 4 (was 5) gapThreshold: 7 (was 8) maxSymbolsInFileHeader: 5 (was 6) maxEdgesPerRelationshipKind: 4 (was 6) includeRelationships: true (kept ON — cheap structural signal) maxCharsPerFile: 3800 (unchanged — monotonic invariant w/ next tier) This survives the cobra-regression-with-trim that the earlier budget-only attempt suffered: with only 5 tools to choose from, the agent doesn't fall back to extra codegraph_node calls when explore returns less — there's no node call available. Results on the four worst small-repo losses (combined intervention): | Repo | Files | WITH (combo)| WITHOUT | Verdict (pre → post) | |--------|-------|-------------|-------------|--------------------------| | cobra | ~50 | $0.25 | $0.31 | loss → **WIN** (-19%) | | ky | ~25 | $0.39 | $0.39 | -42% → tied | | slim | ~80 | $0.31 | $0.24 | LOSS 31% → still LOSS | | sinatra| ~60 | $0.30 | $0.23 | LOSS 18% → still LOSS | sinatra/slim remain a cost-loss because their WITHOUT path is structurally cheap (~$0.20 — fewer than 4 cheap grep+read calls). Codegraph can't beat that absolute floor with any meaningful response. Both still WIN on time + reads + tool-call count. Tests: tier boundary cases updated to cover the new <150 / 150-499 / 500-4999 / 5000-14999 / >=15000 progression. Off-by-one guard updated to include the new 149↔150 boundary. All 1076 tests pass. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

On a <150-file project the entire repo is grep-able in one turn, so the 20-node default `codegraph_context` was paying for a graph subset that exceeds the agent's actual question. Cutting the tiny-repo default to 8 (typical 1-3 entry points + their immediate 1-hop neighbors) reduces the context-tool response body without hitting sufficiency on the flow shapes small repos actually contain. Non-breaking: the agent can still pass an explicit `maxNodes` to override; medium+ repos (>=150 files) keep the 20-node default. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

n=2 audit on cobra/ky/sinatra ruled out cutting below 5 tools (search + context + node + explore + trace) on the tiny-repo tier. The smaller 3-tool gate (search + context + trace) saved ~$0.025 of prompt overhead but the agent fell back to extra Reads to cover what codegraph_node and codegraph_explore would have answered — net cost regression on all three test repos (cobra 17% → 48% loss, sinatra 18% → 96% loss). Documented inline so future tuners don't re-try this dead-end. No behavior change beyond the comment: the 5-tool gate remains the production setting. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Tested the hypothesis that exposing FEWER tools on micro repos (<50 files) would close the cost gap. Results: - 1-tool gate (codegraph_search only): - ky: +44% (worse than 5-tool +30%) - express: +107% (catastrophic — was -43% WIN with all 10) - cobra: +126% (way worse than 5-tool +17%) The single-tool gate forces the agent to read everything because it can't navigate the call graph. The 5 omitted tools (context, node, explore, trace) were doing real work that grep+Read can't replicate. Conclusion: 5 tools (search + context + node + explore + trace) is the empirical lower bound on the tiny-repo tier. Cutting below regresses EVERY tested repo. The remaining ~$0.04-0.08 of structural cost overhead on tiny repos is unavoidable without sacrificing the value codegraph provides at that scale (which would also make WITH = WITHOUT, defeating the install). Comment documents the dead-ends so future tuners don't relitigate. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… in context, hard-exclude low-value files Three layered changes targeting the sinatra/slim/small-repo cost gap that iter2's body-shrink failed to close (smaller bodies just pushed the agent to Read instead): 1. **Tool-gate threshold 150 → 500** (`TINY_REPO_FILE_THRESHOLD`). Sinatra (~159 files) and slim (~200 files) have the same structural problem as cobra (

…siblings in search ranking On projects with a single file holding the dense majority of internal call edges (e.g. sinatra's `lib/sinatra/base.rb` at ~85% of in-file edges), text search was favoring small focused extension files over the core file. A small focused file like `multi_route.rb` wins on verbatim name match + file-size normalization, burying the 1500-line core file's longer method names (e.g. `route!` vs `route`). Fix: detect the "dominant file" — the file whose in-file edge count is ≥3× the next candidate's — then add +25 to all results sharing its directory prefix. This pulls the core file's siblings above sibling-package extensions without hardcoding any repo structure. `getDominantFile()` excludes test/spec files and generated files (e.g. etcd's `rpc.pb.go` has 4× the in-file edges of `server.go` and would otherwise hijack the boost toward generated protobuf stubs). SQL pulls the top 20 candidates; path-pattern filtering handles what SQLite LIKE can't express.

On small projects (<500 files) with a routing-shaped query, build a URL→handler manifest directly from the graph (each `route` node joins to its handler via `references`/`calls` edges) and inline the top handler file's source. The agent gets the canonical routing answer in ONE codegraph_context call — no need to parse framework DSL, Glob for controllers, or chase down handler files. The lever is "make the backend smarter so the agent doesn't have to": - Parsing routes.rb / routes/api.php / urls.py DSL is the agent's job in the WITHOUT arm. Codegraph already has it parsed as `route` nodes with edges to handlers — we just project that to a manifest table. - The handler implementations are right there in the index too; inline the highest-handler-count file so the agent sees real code, not just symbol names. Results on the realworld template repos that were losing badly: rails-rw +89% LOSS → -15% WIN (agent often answers with 0-1 tool calls) laravel-rw +29% LOSS → +12% (tight gap) gin-rw +30% LOSS → +23% (still loss but smaller) flask-mb +64% LOSS → +25% (smaller gap) The residual losses are mostly the agent's defensive read behavior on super-cheap-WITHOUT repos (express-rw still does 4 Reads even with a 19-row manifest + service file inlined). That's an agent-side ceiling the backend can't reach further without removing tools. Also lands `scripts/agent-eval/probe-sweep.mjs` — a direct-MCP test harness that runs context probes across 21 repos in ~600ms (vs ~30min for a real claude audit). Enables rapid iteration on backend changes: edit tools.ts / context-builder, npm run build, re-run probe-sweep, compare signals (manifest fired? handler file inlined? response size?) before paying for a claude run. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…eted files) `MCPEngine.catchUpSync()` reconciles the index against the working tree after open (catching `git pull`/`checkout`/`rebase` and any edits or deletes made while no server was running). It was fire-and-forget — so a tool call landing in the first ~50-300ms could race past it and serve rows for files that no longer exist on disk. The per-file staleness banner can't help here, because that signal is populated by the file watcher (not by catch-up). The fix: `catchUpSync()` now pushes its promise into `ToolHandler` via `setCatchUpGate(p)`; the first `execute()` call awaits the gate and then clears it. Subsequent calls pay nothing. Catch-up rejections are logged by the engine and swallowed by the handler so a transient sync failure never breaks tools. Most visible on the "deleted everything between sessions" case, where MCP previously returned stale rows pointing at non-existent files. Validated end-to-end on a 10,640-file VS Code index: with the gate, a codegraph_search for "ExtensionHost" against an empty (but stale-DB) directory returns "No results found" after the catch-up drains the DB; without the gate, the same call returns 10 stale hits. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ce-override expansion Add entries for work that landed on this branch but wasn't yet in [Unreleased]: tiny-repo tool gating + sufficiency steering + budget tier, auto-inline trace in codegraph_context, routing manifest inline, core-directory ranking boost, JVM-only interfaceOverrideEdges extended to C#/TS/JS/Swift/Scala, and the shorter tool descriptions. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

colbymchenry and others added 14 commits May 27, 2026 02:28

colbymchenry changed the title ~~feat(go): generated-file down-rank + gRPC stub-impl bridge + trace-failure inlining~~ feat(mcp): multi-module Go trace-quality + small-repo retrieval tuning May 28, 2026

colbymchenry merged commit 71935e3 into main May 28, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(mcp): multi-module Go trace-quality + small-repo retrieval tuning#494

feat(mcp): multi-module Go trace-quality + small-repo retrieval tuning#494
colbymchenry merged 14 commits into
mainfrom
feat/go-multi-module-trace-quality

colbymchenry commented May 27, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

colbymchenry commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

1. Multi-module Go trace quality

2. Small-repo retrieval tuning (<500 files)

3. Other improvements that landed alongside

Empirical results

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

colbymchenry commented May 27, 2026 •

edited

Loading

2. Small-repo retrieval tuning (`<500` files)